Objective: To tabulate the number of full-length reads obtained per gene from Isoseq and order genes from high to low, for comparison with RNAseq data for exact sample

Rationale: To evaluate whether Isoseq output comparable to RNAseq output

Analysis: 1. Downloaded raw subread.bam file from Sequel output 2. CCS and Isoseq3 command line (Lima, Cluster, Polish) 3. Mapped to mouse genome using GMAP 4. Tofu Cupcake 5. Sqanti for isoform characterisation

source("/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/Scripts/IsoSeq3_Tg4510/RNASeqvsIsoSeq.R")

Step 1) IsoSeq Preparation: Annotate2Abundance

Define function for Importing and Merging SQANTI classification file and TOFU abundance file

Input: Sqanti_Filter Classification Output file * All details of HQ-unique isoforms classified by assigning PacBio output gene Cluster ID to mouse gene name

Input: ToFU Abundance Output file * Quantification of number of Full_Length per PacBio_ID

Output: Merged txt file by PacBio ID * Merged txt file has the gene name by which the isoform belongs to (as identified by SQANTI) and the quantification of FL_counts (as quantified in TOFU) by PacBio ID

SQANTI_input("K17")
## [1] "Input SQANTI Filter output file for Sample K17"
## [1] "SQANTI Classification file of Sample K17"
TOFU_input("K17")
## [1] "Input SQANTI Filter output file for Sample K17"
## [1] "/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/WholeTranscriptome/Individual/ToFU/K17.collapsed.filtered.abundance.txt"
## [1] "Abundance file of Sample K17"
Annotate2Abundance_IsoSeq("K17",classification_dat,abundance_dat)
## [1] "Merged file of SQANTI Classification and Abundance File of Sample K17"

Step 2) IsoSeq Preparation: SumFLCounts

Define function that the FL Counts for all transcripts per gene

Motivation: SQANTI Filter classification outputs one gene with multiple isoforms, thus complicates correlation with RNA-Seq Gene Expression Counts. PacBio FL count is presented per isoform rather than per gene. However, FeatureCount’s output from RNA-seq data is on a gene level. Therefore FL counts from IsoSeq needs to be summed for more convenient comparison: Total FL Counts of Transcripts per Gene from IsoSeq vs Raw Gene Counts from RNASeq

Alternative option: select only isoform with the highest number of FL counts, yet biased results especially given if many isoforms with similar or slgihtly smaller number of FL-counts. Assumptions: RNA-seq captures expression of all RNA transcripts irrespective of isoforms

SumFLCounts(Merge_IsoSeq)

Step 3)RNASeq Preparation

Input: FeatureCounts of all RNASeq samples (STAR Aligned to mm10 genome, and annotated to Gencode Mouse V20 gtf file) at gene level Output: FeatureCount of specific sample

Input_RNAseq("K17")
## [1] "Input FeatureCount for All Samples"
##                             B21 C21 K17 K23 M21 O23 Q21 S23
## ENSMUSG00000000001.4_Gnai3  761 565 374 523 582 375 418 410
## ENSMUSG00000000003.15_Pbsn    0   0   0   0   0   0   0   0
## ENSMUSG00000000028.15_Cdc45  19  20  20  25  32  24  24  24
## ENSMUSG00000000031.16_H19     1   2   3   2   0   0  12   0
## ENSMUSG00000000037.16_Scml2   7  14  12   6  12   7   4  15
## ENSMUSG00000000049.11_Apoh    3   3   1   4   0   0   3   2
## [1] "Input FeatureCount for Sample K17"
Validation_SumFLCounts("App",Merge_IsoSeq)
## [1] "Validation of summing PacBio FL"
## [1] "Original input data from ToFU Abundance files for the Gene App"
## [1] "Summed PacBio FL count for the Gene App saved as new dataframe for downstream analysis"
## [[1]]
##      associated_gene count_fl
## 4731             App        2
## 4732             App        2
## 4733             App        3
## 4734             App        2
## 4735             App        2
## 4736             App       24
## 4737             App        4
## 4738             App        2
## 4739             App        4
## 4740             App        2
## 4741             App        2
## 4742             App       58
## 
## [[2]]
## # A tibble: 1 x 3
##   associated_gene PacBio_Isoform PacBio_FL_Counts
##   <chr>           <fct>                     <int>
## 1 App             PB.3510.14                  107

Step 4) Merge RNASeq and IsoSeq

Input: Sample-specific Isoseq (Dataframe: Merge_IsoSeq_SumFL) and RNASeq (Dataframe: RNASeq) Counts Output: Dataframe “Full_Merge”: Merged Counts across IsoSeq and RNASeq by gene names

Also to call out specific counts of AD-associated genes, created function AD_Counts.

Merge_RNASeq_Isoseq(Merge_IsoSeq_SumFL,RNASeq) # output file name = Full_Merge
AD_Genes <- c("Apoe","App","Mapt","Psen1")
AD_Counts(AD_Genes)

Step 5) Data Review for Full_Merge: RNASeq vs IsoSeq

Motivation: Within Full_Merge dataframe, interested to know which genes are detected only by IsoSeq, only by RNASeq, and alone. Also later downstream, able to plot the number of respective counts for these genes.

Missing_Reads_Review()
## [1] "Total Number of Genes in Full_Merge of IsoSeq and RNASeq: 17375"
## [1] "Total Number of Genes Detected in IsoSeq AND RNASeq: 8576"
## [1] "Total Number of Genes Detected in IsoSeq but not RNASeq: 134"
## [1] "Total Number of Genes Detected in RNASeq but not IsoSeq: 8665"

Step 6) Correlation

output: Correlation of Gene Expression of IsoSeq FL Counts vs RNASeq Raw Counts. Correlation coefficient calculated from pearson’s method (assuming parametric) and considers

Run_Corplot(Full_Merge)

Step 7) Missing Reads

input: Genes either detected by IsoSeq or RNASeq from Full_Merge dataframe (IsoSeq and RNASeq Counts/gene) output: Plot of those genes with its respective counts

Missing_Reads_Plot(Full_Merge)

Missing_Genes(5000)
## [1] "Genes with no IsoSeq Reads but RNASeq RawCounts > 5000"
## [1] "Actb"     "Gm13340"  "Hspa8"    "Ids"      "Kcnj10"   "Mapk8ip3"
## [7] "mt-Nd6"   "Slc12a5" 
## [1] "Genes with only IsoSeq Reads, and no RNASeq Reads"
##   [1] "1700028I16Rik"        "1700047I17Rik2"       "1810009A15Rik"       
##   [4] "2900079G21Rik"        "3110021N24Rik"        "A930015D03Rik"       
##   [7] "Aarsd1"               "AC110573.1"           "AC122413.2"          
##  [10] "AC164314.1"           "Akap2"                "AL731706.1"          
##  [13] "Atoh7"                "Atp6v0a4"             "B230377A18Rik"       
##  [16] "B3gnt2"               "C1qtnf5"              "Cdk1"                
##  [19] "Cebpd"                "Chad"                 "Commd5"              
##  [22] "Cox20"                "Cpne1"                "Dpys"                
##  [25] "Entpd4"               "Entpd4b"              "Epo"                 
##  [28] "Exosc6"               "Fam177a"              "Fdx1l"               
##  [31] "Fen1"                 "Gal3st4"              "Galnt2"              
##  [34] "Gm10108"              "Gm10177"              "Gm11847"             
##  [37] "Gm13166"              "Gm13370"              "Gm14022"             
##  [40] "Gm14391"              "Gm14435"              "Gm15387"             
##  [43] "Gm19412"              "Gm20388"              "Gm20431"             
##  [46] "Gm20479"              "Gm20662"              "Gm20878"             
##  [49] "Gm26561"              "Gm26745"              "Gm26786"             
##  [52] "Gm26904"              "Gm28052"              "Gm28374"             
##  [55] "Gm28635"              "Gm29253"              "Gm3448"              
##  [58] "Gm3591"               "Gm3667"               "Gm42418"             
##  [61] "Gm42466"              "Gm42742"              "Gm42936"             
##  [64] "Gm43552"              "Gm44618"              "Gm45208"             
##  [67] "Gm45213"              "Gm45837"              "Gm49032"             
##  [70] "Gm49369"              "Gm49539"              "Gm5641"              
##  [73] "Gm8281"               "Gnpnat1"              "Gpr25"               
##  [76] "H2-Ke6"               "Hgs"                  "Hist1h2ap"           
##  [79] "Hist2h2aa1"           "Hpdl"                 "Hrct1"               
##  [82] "Hsd3b7"               "Ikzf5"                "Ilk"                 
##  [85] "Ing4"                 "Lrch4"                "Med10"               
##  [88] "Moap1"                "Mrpl53"               "Nanp"                
##  [91] "Nme3"                 "novelGene_113"        "novelGene_29"        
##  [94] "novelGene_31"         "novelGene_60"         "novelGene_65"        
##  [97] "novelGene_68"         "novelGene_78"         "novelGene_Anp32a_AS" 
## [100] "novelGene_Aven_AS"    "novelGene_Dcaf5_AS"   "novelGene_Gm14597_AS"
## [103] "novelGene_Igsf21_AS"  "novelGene_Pitpnm2_AS" "novelGene_Stx3_AS"   
## [106] "Npcd"                 "Nrtn"                 "Nudt8"               
## [109] "Pdf"                  "Pet117"               "Pgk1-rs7"            
## [112] "Prrt2"                "Ptp4a1"               "Rab43"               
## [115] "Rangrf"               "Raver1"               "Rbm42"               
## [118] "Rhbdl1"               "Rnaset2a"             "Rnf26"               
## [121] "Rpl15-ps3"            "Rpl27-ps3"            "S100a3"              
## [124] "Scamp4"               "Shisa8"               "Siglece"             
## [127] "Smco3"                "Sp5"                  "Tmem254b"            
## [130] "Tmsb15b1"             "Tomm6"                "U2af1l4"             
## [133] "Vps25"                "Wdr74"